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(54) Methods for polymorphism identification and profiling 



(57) The invention provides methods of using probe 
arrays for polymorphism identification and profiling. 
Such methods entail constructing a first array of probes 
that span and are complementary to one or more known 
DNA sequences. This array is hybridized with nucleic 



acid samples from different individuals to identify a col- 
lection of polymorphisms. A second array is then con- 
structed to determine a polymorphic profile of an indi- 
vidual at the collection of polymorphic sites. The poly- 
morphic profile is useful for, e.g., genetic mapping, ep- 
idemiology, diagnosis and forensics. 
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Description 

TECHNICAL FIELD 

[0001] The invention resides in the technical fields ol molecular genetics, medicine and forensics. 
BACKGROUND 

rnni»l The oenomes ol all organisms undergo spontaneous mutation in the course ot Iheir continuing evolution 
S.^rv«r^^XZ»"sequences(G^ella^ 

Ty SnT^Tolutionary^antege or disadvantage relative to a progenitor lom, or may n^"*™^'" 
Scral^^^ormconiersalethal disadvantage and is nottransmilted to subsequent generat^^^ 

^Ttht^'^ste^ a variant form coolers an evolutiooary advantage to the species and is eventually ^.corpora ed ,nto 
L orrn^ny or Zt me-T^ers of the species and effeot^ely becomes the progenftor fom,. In many ~es 
Xr^trind variant fom,(s) survive and c<.exis. in a species popula.ton. The coexistence of multiple forms of 

S"?or:MtrX-CJmorphism have been reported. A res,r.t^ fragmerjt length po^r^phism 
K> metTs a varte ton in DNA sequence that alters the lengm of a restriction fragment as descrtoed Bots'« " « 
if Am ™32, 314-331 (1980). Other poly,.x>rphisms take the fom. of short tandem repeats (STRs) »«. 

cl^d'tan^IdM" and te.,aH,uc,eotide repealed motKs. Some po^morphisms take ° -8'^ 
laitettons betwee.^ ndivkiuals of the same species. Such polymorphisms are far more frequent than RFLP^. STRs 
Td rTRstnTe "ucleotkle po^morphisms can occur any^vhere ^ prolein^ing sequences. .n.ron,c sequences, 

irS'^^K^^T^N- Wo-^^^^^^^^ - - P^'-'VPic effect Some po^orphisms, P« f^- 
Sinq^l coding sequences, are known to be the direct cause of sertous genetic diseases, such as «icWe ceN 

expression without necessarily affecting the nature of the expressK)n product. .^ed in WO 95/11995 
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3" base than its predecessor), the loss of hybridization intensity is manifested as a •footprint' of probes approximately 
centered about the point of variation between the target sequence and reference sequence. 

SUMMARY OF THE CLAIMED INVENTION 

5 

[0009] In one aspect, the invention provides methods of polymorphism analysis. Such methods entail constructing 
a first array of probes for polyrrxsrphism identification. The probes in such an array span and are complementary to 
one or more known DNA sequences. The first array of probes is then hybridized with nucleic acid samples from different 
individuals. Differences in the hybridization pattern of the samples to the probes among the different Individuals indicate 

10 the location of one or more polymorphic sites in the one or rrwre DNA sequences. The above steps are repeated, as 
needed, until a collection of polymorphic sites in known DNA sequences has been identified. A second array of probes 
is then constructed for polymorphism profiling. The second array comprises a first set of probes spanning each of the 
polymorphic sites in the collection and complementary to polymorphic forms present in the known sequences, and a 
second set of probes spanning each of the polyrDorphic sites in the collection and complementary to polymorphic forms 
absent In the known DNA sequences. The second array of probes Is hybridized to a nucleic acid sample from a further 
Individual. The hybridlzatbn intensities of probes In the first and second sets of probes are ar^lyzed to determine a 
profile of polymorphic forms present in the further individual. In some methods, the further individual has a known 
characteristic whose presence or absence is unknown in the different individuals used in the prior polymorphism anal- 
ysis. This characteristic can be, for example, the presence of absence of a disease or of being suspected of perpetrating 

20 a crime. 

[0010] Some methods further comprises retrieving a known DNA sequences from a computer database for use in 
the polymorphism identificatran steps. The probes in the first array are selected to be complementary to the known 
sequence. Often, the known sequences used for polymorphism kJentificatlon are of unknown function. In some meth- 
ods, the sequences used for polymorphism identlflcatkDn are expressed sequence tags. In some methods, at least 100 
2S known sequences are used for polymorphism identificatbn. In some methods, at least some of the known sequences 
are unlinked; for example, known sequences can occur on at least 2 chromosomes, or each of the 23 human chromo- 
somes. 

[0011] In some methods, the further nucleic acid sample is RNA or cDNA. In such methods, the hybrkJization inten- 
sities of probes in the second array can be used to identify a subset of polynrkorphic sites In a subset of known sequences 
30 which are expressed in the further nucleic acid sample. Profiles of the subset of polymorphs sites can readily be 
obtained from an RNA sample without amplification. 

[0012] In another aspect, the inventbn provides methods of determining whether discrepancies in published sequenc- 
es represent true genetic variation or are sequencing errors. Such methods entail retrieving multiple versions of a nucleic 
acid sequence from a published source. The multiple versions to identify point(s) of diversion. An array of probes is then 

3S designed that span and are complementary to part(s) of the nucleic acid sequence spanning the point(s) of diversion. 
Nucleic acid samples from multiple indivkluals are then hybrkiized to the array The existence of difference(s) in the 
hybridizatkjn Intensity of probes spanning a point of diversion among the indivkJuals indicates a polymorphism at the 
point of diversbn, and lack of difference in hybridization intensity of probes spanning a point of diversion among the 
individuals indicates the point of diversion was due to a sequencing error. In some methods, the published sources of 

40 sequences is a computer database. In some methods, the multiple version of the nuciek; acid sequence are retrieved 
as trace profiles. 

[0013] In another aspect, the inventkxi provides methods of polymorphism profiling. Such methods entail providing 
an array of Immobilized probes. The array comprises a first set of probes spanning each of a collection of polynwrphk: 
sites in known sequences and complementary to a first allelic fonms of the sites, and a second set of probes spanning 

45 each of the polymorphic sites In the collection and complementary to second allelic forms of the sites. The collection 
of polynmrphic sites Includes at least 10 unlinked polynrK>rphk: sites. A nucleic ackJ sample from an individual is hy- 
bridized to the array of probes and the hybridizatbn intensities of probes in the first and second prc^e sets are analyzed 
todetemaine a profile of polymorphic forms present in the indivkiual. In some such methods, the collection of polynwr- 
phic sites indudes a polymorphic site on each of the 23 human chromosomes. In some methods, the collectbn of 

so polynrwrphlc sites Includes at least 100 polymorphic sites. 

[0014] In some methods, the hybridizing step is repeated for nucleic ackJ samples from a population of individuals, 
each of whom is characterized for the presence or absence of a phenotype, to determine a profile of potynrvorphic forms 
present in each Individual in the populatkin, and the method further comprises correlating profiles of polymorphic forn^ 
with the presence or absence of the phenotype In the population. 

55 [0016] In some methods, the nuclec acid sample is RNA or cDNA. In such methods, optionally, one can enrkih for 
transcripts of the known sequences in the nucleic ackf sample by contacting the nucleic acid sample with probes 
complementary to the transcripts, whereby the probes hybridize to the transcripts to form complexes, isolating the 
complexes and dissociating the transcripts of the known sequences from the probes. 
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DEFINITIONS 

[0016] A nucleic acid is a deoxyribonucleotida or ribonucleotide polymer in either single-or double-stranded form, 
including known analogs of natural nucleotides unless othenwse indicated. 

[0017] An oligonucleotide is a single-stranded nucleic acid ranging in length from 2 to about 500 bases. 
[0018] A probe is an oligonucleotide capable of binding to a target nucleic acid of complementary sequence through 
one or more types of chemical bonds, usually through complementary base pairing, usually through hydrogen bond 
formation. An oligonucleotide probe may include natural {La A. G, C. or T) or modified bases (e.g.. 7-deazaguanosine, 
inoslne). In addition, the bases in oligonucleotide probe may be joined by a linkage other than a phosphodiester bond! 
so tong as it does not intertere with hybridization. Thus, oligonucleotide probes may be peptkJe nucleic acids in which 
the constituent bases are joined by peptkJe bonds rather than phosphodiester linkages. 

[0019] Specific hybridization refers to the binding, duplexing, or hybridizing of a molecule only to a particular nucle- 
otide sequence under stringent conditkms when that sequence is present in a complex mixture {e.g., total cellular) 
DMA or RNA. Stringent conditk>ns are conditions under which a probe will hybridize to its target subsequence, but to 
no other sequences. Stringent conditksns are sequencedependenl and are different in different circumstances. Longer 
sequences hybridize specifically at higher temperatures. Generally, stringent conditions are selected to be about S^C 
lower than the thermal melting point (Tm) for the specific sequence at a defined ionic strength and pH. The Tm Is the 
temperature (under defined bnic strength, pH, and nucleic acid concentration) at which 50% of the probes comple- 
mentary to the target sequence hybridize to the target sequence at equilibrium. (As the target sequences are generally 
present In excess, at Tm. 50% of the probes are occupied at equilibrium). Typically, stringent conditions include a salt 
concentratbn of at least about 0.01 to 1 .0 M Na ion concentration (or other salts) at pH 7.0 to 8.3 and the temperature 
is at least about 30°C for short probes {e.g., 10 to 50 nucleotides). Stringent conditions can also be achieved with the 
addition of destabilizing agents such as fomnamide. For example, conditions of 5X SSPE (750 mM NaCI, 50 mM Na- 
Phosphate, 5 mM EOTA, pH 7.4) and a temperature of 25-30»C are suitable for allete-specific probe hybridizations. 
[0020] A perfectly matched probe has a sequence perfectly complementary to a particular target sequence. The test 
probe Js typically perfectly complementary to a portion (subsequence) of the target sequence. The term "mismatch 
probe" refer to probes whose sequence is deliberately selected not to be perfectly complementary to a particular target 
sequence. Although the mismatch(s) may be located anywhere in the mismatch probe, terminal mismatches are less 
desirable as a terminal misnnatch is less likely to prevent hybridization of the target sequence. Thus, probes are often 
designed to have the mismatch located at or near the center of the probe such that the mismatch is most likely to 
destabilize the duplex with the target sequence under the test fiybridizatwn conditions. 

[0021] A polymorphic marker or site is the Uxus at which divergence occurs. Preferred markers have at least two 
alleles, each occurring at frequency of greater than 1%, and more preferably greater than 10% or 20% of a selected 
populatbn. A polymorphic locus may be as small as one base pair. Polymorphic markers include restriction fragment 
length polymorphisms, variable number of tandem repeats (VNTR's), hypervariable regions, minisatellites. dinucleotide 
repeats, trinucleotide repeats, tetranucleotide repeats, simple sequence repeats, and insertion elements such as Alu. 
The first identified allelic form is arbitrarily designated as a the reference fomn and other allelic forms are designated 
as alternative or variant alleles. The allelic fonm occurring most frequently in a selected population is sometimes referred 
to as the wildtype iorm. DiplokJ organisms may be homozygous or heterozygous for allelic forms. A diallelk; polymor- 
phism has two forms. A triallelfc polymorphism has three forms. 

[0022] A single nucleotide polymorphism (SNP) occurs at a polymorphs site occupied by a single nucleotide, which 
is the site of variation between allelic sequences. The site is usually preceded by and followed by highly conserved 
sequences of the allele (e.g., sequences that vary in less than 1/100 or 1/1000 members of the populations). 
[0023] A single nucleotide polymorphism usually arises due to substitutwn of one nucleotide for another at the pol- 
ymorphic site. A transition is the replacement of one purine by another purine or one pyrimidine by another pyrimidine. 
A transversion is the replacement of a purine by a pyrimidine or vice versa. Single nucleotide polymorphisms can also 
arise from a detetton of a nucleotide or an insertkin of a nucleotide relative to a reference allele. 

DETAILED DISCLOSURE 

I. General 

[0024] The invention emptoys arrays of oligonucleotkJe probes for de novo identification of polymorphisms and for 
the use of such polymorphisms in determining a polymorphic profile of an individual. De novo identificatbn of polymor- 
phisms starts with a nucleic ackl fragment whose sequence is known, which is designated a reference sequence. The 
reference sequence can be obtained from a computer database or from the published literature or can be determined 
by any conventional means. A probe array is constaicted containing probes spanning and complementary to the ref- 
erence sequence, or any segment of interest thereof. The array of probes is hybridized to nucleic acid samples from 
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a collection of indrv.duals. If different allelic forms of the reference sequence are present in the collection of individuals 
the array of probes shows a different hybridization pattern to different samples. The hybridization patterns can be 
mterpreted to reveal the location of a polymorphic site In the reference sequence, and in some instances, the pofymor- 
phic forms present at this site. 

5 [0025] The more individuals that are screened, the more likely it is that a polymorphic site is identified. About ten 
individuals .s sufficient to identity most polymorphic sites. Typically the individuals are unrelated. The individuals need 
not be from any geographic, religious or ethnic subclass. Indeed, selecting individuals from different subclasses can 
increase the probability of Identifying a polynrrorphic site. In most Instances, the individuals are humans, but plants and 
animate can also be used. Typically, the Individuals have not been characterized for the presence or absence of a 
10 selected trait, such as a particular disease. 

[0026] The above identification process can be carried out on a large scale. For example, the same support can 
have attached multiple subsets of probes that span and are complementary to multiple reference sequences The 
reference sequences need not be related; for example, they can be from different chromosomes. The hybridization 
pattern of each subset of probes to nucleic acid samples from different individuals can be interpreted independently 
to de emiine the existence and nature of polymorphic sites in each of the reference sequences. Attemativety. or addi- 
tionalfy. subsets of probes spanning and complementarity to different reference sequences can be immobilized on 
separate supports, and the supports individually hybridized to samples from different individuals. Ultimately, a collection 
of pojrmorphism in a collection of reference sequences is identified. A collection of several thousand polymorphisms 
Identified by this approach is described in commonly owned applications USSN 08/81 3.159. 60/042 125 60/050 594 
[0027] A secondary array is then constructed to use the previously identified polymorphisms for polymorishic profilinq* 
T^e secondary array includes a first group of probes, which span polymorphic sites and flanking bases, and are corn- 
plementary to the reference sequences. The secondary array includes a second group of probes, which also span the 
polymorphic sites and flanking bases, but which are designed to be complementary to allelic variant forms of the 
P. reference sequences. The secondary array typically includes probes spanning a targe collection of polymorphic sites 
°' ^e'^^"^^ ^"^y is hybridized to a nucleic acid sample from a further Individual. Analysis of 
the hybridization pattern indicates which allelic form is present at the polymorphic sites included in the secondary array 
thereby developing a polymorphic profile of the Individual. 

[0028] The polymorphic profile can be used in association studies. That is. by detemiining polymorphic individuals 
in a poputetion of individuals, each of whom has been characterized for the presence or absence of a phenotypic trait 
one can determine which polynnorphic forms, alone or In combination, are correlated with the trait. Attemativety. once 
a correlation of traits with polymorphic forms has been performed, determination of a polymorphic profile in an individual 
can be used to predict susceptibility to traits without direct phenotypic testing of the individual. Polymorphic profiles 
are also useful in forensics and paternity testing. 

^ 2. Reference Sequences 

[0029] Reference sequences for polymorphic site identificatfon are often obtained from computer databases such 
as Genbank. the Stanford Genome Center. The Institute for Genome Research and the Whitehead Institute The latter 
databases are available at http://www-genome.wi.mit.edu; http://shgc.stanford.edu and http://ww.tigr.org. A reference 
^f?)I^"f^^fu ^^"^ '^"^ ^ ^^^^ *° '^^^^ ^^'^0 References sequences are typically of the order 
of 1 00-1 000 bases. The reference sequence can be from expressed or nonexpressed regions of the genome tn some 
fTiethods. in which RNA samples are used, highly expressed reference sequences are sometimes preferred to avoid 
the iieed for RNA amplfficatk>n. The reference sequences can be from proximate areas of the genome or can be 
unrelated^ Preferably, diverse reference sequences are analyzed to identify a correspondingly diverse collection of 
potynwrphisms. For example, the reference sequences can come from two or more chromosomes of the human ge- 
nome. In some methods, reference sequences from each of the 23 human chromosomes is present. In some instances 
the function of a reference sequence is known, but more commonly reference sequences are of unknown function' 
reference sequences can also be from episomes such as mitochondrial DNA. 

^ 3. Nucleic Acid Sample Preparation 

[0030] The nucleic acid samples hybrWized to arrays can be genomic. RNA or cDNA. Genomic DNA samples are 
usually subject o amplification before application to an array An individual genomic DNA segment from the same 
genomic location as a designated reference sequence can be amplified by using primers flanking the reference se- 
querice. Multiple genomic segments corresponding to multiple reference sequences can be prepared by multiplex 
amplification including primer pairs flanking each reference sequence in the amplification mix. Alternatively, the entire 
o!Q^^nT ^""P""^^ "®'"9 r^do^ P""^er8 (typcally hexamers) (see Barrett et al.. Nucleic Acids R^earch 23 
348B-3492 (1995)) or by f ragmentatk>n and reassembly (see, e.g.. Stemmer et al., Gene 1 64, 49-53 (1 995)) Genomic 
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30 



ONA can be obtained from virtually any tissue source (other than pure red blood cells). For example, convenient tissue 
samples include whole blood, semen, saliva, tears, urine, fecal material, sweat, buccal, skin and hair 
[0031] RNA samples are also often subject to amplification. In this case amplification is typically preceded by reverse 
transcnption. Amplification of all expressed mRNA can be perfomied as described by commonly owned WO 96/14839 
and WO 97/01603. Selective amplification of transcripts containing polymorphic site contained on a chip can be 
achieved by the same approach as for genomic DNA. Selected transcripts containing the polymorphic sites tiled on a 
substrate can also be enriched by prehybridizing with biotin-labelled oligonucleotides complementarity to such tran- 
scripts. After separation of unbound oligonucleotides, hybridization complexes can be separated by affinity chroma- 
tography to streptavidin-coated magnetic beads. If RNA species are in molar excess relative to corresponding labelled 
oligonucleotides, and different oligonucleotides are used in equimolar amounts, the prehybridization step also has the 
effect of equalizing the concentrations of the RNA species for which probes are included in the array 
10032] In some methods, in which arrays are designed to tile polynrxsrphic sites occurring in highly expressed se- 
quences, amplification of RNA is unnecessaiy. A species occurring at a relative abundance of 1 30,000 in RNA of total 
concentration of 0. 1 mg/ml can be detected. The choice of tissue from which the sample is obtained affects the relative 
and absolute levels of different RNA transcripts in the sample. For example, cytochromes P450 are expressed at hiah 
evels in the Irvfir ** 



levels in the liver. 
4. Methods of amplification 



[0033] The PCR method of amplification is described in PCR Technology: Principles and Applications for DNA Am- 
pHf,caUon{e6. H.A. Eriich. Freeman Press. NY. NY. 1992); PCR Protocols: k Guide to Methods and Applications 
Innis, et al. . Academic Press. San Diego. CA. 1 990); Maltila et al., Nucleic Acids Res. 1 9, 4967 (1 991 ) Eckert et a! 
PCR Methods and Applications 1. 17 (1991); PCR (eds. McPherson et al., IRL Press. Oxford); and US Patent 
4.683,202 (each of which is incorporated by reference for all purposes). Nucleic acids in a target sample are usually 
labelled in the course of amplification by incluswn of one or more labelled nucleotides in the amplification mix Ubels 
can also be attached to amplification products after amplification e.g.. by end- label ling. The amplification product can 
be RNA or DNA depending on the enzyme and substrates used in the amplification reaction. 
[0034] Other suitable amplification methods include the ligase chain reaction (LCR) (see Wu and Wallace, Genomics 
4, 560 (1989). Landegren et al.. Sclence24^, 1077 (1988). transcription amplification (Kwoh et al.. Proc. Natl. Acad 
Sa. USA 86. 1173 (1989)), and self -sustained sequence replicatbn (Guatelli et al., Proc. Nat. Acad Sci. USA 87 
1874 (1990)) and nucleic acid based sequence amplification (NASBA). The latter two amplification methods involve 
isothermal reactions based on isothem^l transcription, which produce both single stranded RNA (ssRNA) and double 
stranded DNA (dsDNA) as the amplification products in a ratb of about 30 or 100 to 1, respectively. 

35 5. Probe Arrays 

[0035] The primary arrays of probes contain at least a first set of probes that tiles one or more reference sequences 
(or regions of interest therein). Tiling means that the probe set contains overlapping probes which are complementary 
to and span a region of interest In the reference allele. For example, a probe set might contain a ladder of probes each 

40 of which differs from its predecessor in the omission of a 5' base and the acquisitbn of an additional 3" base The 
probes in a probe set may or may not be the same length. The number of probes can vary widely from about 6 10 
20, 50, 100, 1000. to 10,000 or 100.000. ' * 

[0036] Such an array is hybridized to target samples from individuals under test and/or to a control sample known 
to contain the reference sequence(s) tiled by the array Optionally, the array can by hybridized simultaneously to more 
than one target sample or to a target sample and reference sequence by use of twoKiolor labelling (e.g . the reference 
sequence bears one label and a target saitiple bears a second label). If the array is hybridized to a control reference 
sequence (or a target sequence that is kJentical to the reference sequence), all probes in the first probe set specifically 
hybndize to the reference sequence. If the array is hybridized to a target sample containing a target sequence that 
differs from the reference sequence at a polymorphic site, then probes flanking the polymorphic site do not show 

^ specific hybridization, whereas other probes in the first probe set distal to the polymorphic site do show specific hy- 
bridization. The existence of a polymorphism is also manifested by differences in normalized hybridization intensities 
of probes flanking the polynrwrphism when the probes hybridized to corresponding targets from different individuals 
Forexample. relatlvelossof hybridizatbn intensity in a "footprint" of probes flanking a polymorphism signals a d^^^^ 
between the target and reference (i.e.. a polymorphism) (see EP 717.113. incorporated by reference in its entirety for 

55 all purposes). Additkxially. hybridizatbn intensities for corresponding targets from different individuals can be classified 
into groups or clusters suggested by the data, not defined a priori, such that isolates in a give cluster tend to be similar 
and isolates in different clusters tend to be dissimilar. See WO 97/29212 (incorporated by reference in its entirety for 
all purposes). 
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(0037] Optionally, primary arrays of probes can also contain second, third and fourth probe sets as described In WO 
95/11995. The probes from the three additional probe sets are identical to a corresponding probe from the first probe 
set except at the interrogation position, which occurs rn the same position in each of the four corresponding probes 
from the four probe sets, and is occupied by a different nucleotide in the four probe sets. After hybridization of such 

5 an array to a labelled target sequence, analysis of the pattem of label revealed the nature and position of differences 
between the target and reference sequence. For example, comparison of the intensities of four corresponding probes 
reveals the Identity of a corresponding nucleotide in the target sequences aligned with the interrogation position of the 
probes. The corresponding nucleotide is the complement of the nucleotide occupying the interrogation position of the 
probe showing the highest intensity. 

10 [0038] Optionally, primary arrays tile both strands of reference sequences. Both strands are tiled separately using 
the same principles described above, and the hybridization patterns of the two tilings are analyzed separately Typically, 
the hybridization patterns of the two strands indicates the same results (i.e., location and/or nature of polymorphic 
form) Increasing confidence in the analysis. Occasionally, there may be an apparent inconsistency between the hy- 
bridization patterns of the two strands due to, for example, base-composition effects on hybridization intensities. Such 

'5 inconsistency signals the desirability of rechecking a target sample either by the same means or by some other se- 
quencing methods, such as use of an AB) sequencer. 

[0039] The secondary arrays used for analyzing previously identified polynnorphisms typically differ from the primary 
arrays in the following respects. First, whereas probes are typically included to span the entire length of a reference 
sequence in primary arrays, in secondary arrays only a segment of a reference sequence containing a polymorphic 

20 site and immediately flanking bases is typically spanned in secondary arrays. For example, this segment is often of a 
length commensurate with that of the probes. Second, a secondary array typically includes at least two groups of 
probes. A first group of probes is designed based on the reference sequence, and the second group based on a 
polymorphic form thereof. If there are three polymorphic forms at a given polynrwrphic site, a third group of probes can 
be Included. Finally, because fewer probes are generally required to analyze precharacterized polymorphisms than in 

25 the de novo identification of polymorphisms, secondary arrays often are designed to detect more different polynrorphic 
sites than primary arrays. For example, a primary array typically detects 1-100 polymorphic sites in 1-100 references. 
A secondary array can easily analyze 1,000, 10.000 or 100,000 polymorphic sites in reference sequences dispersed 
throughout the human genome. 

[0040] The design of suitable probe arrays for analysis of predetermined polymorphisms and interpretation of the 

30 hybridization patterns is described in detail in WO 95/11995; EP 717,113; and WO 97/29212. Such arrays typically 
contain first and second groups of probes which are designed to be complementary to different allelic forms of the 
polymorphism. Each group contains a first set of probes, which is subdivided into subsets, one subset for each poly- 
morphism. Each subset contains probes that span a polymorphism and proximate bases and are complementary to 
one allelic form of the polymorphism. Thus, within the first and second probe groups there are corresponding subsets 

35 of probes for each polymorphism. The hybridization patterns of these probes to target samples can be analyzed by 
f ootprinting or cluster analysis, as described above. For example, if the first and second probes groups contain subsets 
of probes respectively complementarity to first and second allelic forms of a polymorphic site spanned by the probes, 
then on hybridization of the array to a sample that is homozygous for the first allelic form all probes in the subset from 
the first group show specific hybridization, whereas probes in the subset from the second group that span the poly- 

^ morphism show only mismatch hybridization. The mismatch hybridization is manifested as a footprint of probe inten- 
sities in a plot of normalized probe intensity (i.e.. target/reference intensity ratio) for the subset of probes in the second 
group. Conversely, if the target sample is homozygous for the second allelic form, a footprint is obsen^ed in the nor- 
malized hybridization intensities of probes in the subset from the first probe group. If the target sample is heterozygous 
for both allelic forms then a footprint is seen in normalized probe intensities from subsets in both probe groups although 

45 the depression of intensity ratio within the footprint is less marked than in footprints obsen^ed with homozygous alleles. 
[0041] Alternatively, the first and second groups of probes can contain first, second, third and fourth probe sets. Each 
of the probe sets can be subdivided into subsets, one for each polymorphism to be analyzed by the array. The first set 
of probes in each group is spans a polymorphic site and proximate bases and is complementary to one allelic form of 
the site. The second, third and fourth sets, each have a corresponding probe for each probe in the first probe set, which 

so is identica! to a con-esponding probe from the first probe set except at the interrogation position, which occurs in the 
same position in each of the four corresponding probes from the four probe sets, and is occupied by a different nucle- 
otide in the four probe sets. 

[0042] Such arrays are interpreted in similar manner to the primary arrays having four sets of probes described 
above. For example, conskJer a secondary array having first and second groups of probes, each having four sets of 
55 probes designed based on first and second allelic forms of a single polymorphic site hybridized to a target containing 
homozygous first allele. The probes from the first probe set of the first group all show perfect hybridization to the target 
sample, and probes from other probe sets in the first group all show mismatch hybridizatbn. All probes from the second 
group of probes show at least one mismatch except one of the four corresponding probes having an interrogation 
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position aligned with the poVmoqahic site. A probe from the second, third or fourth probes sets probes having an 
interrogation position occupied by a base that is the complement of the corresponding base in the first allelic font) 
shows specific hybridization. 

[0043] If such an array Is hybridized to a target sample containing homozygous second allelic form, the mirror image 
hybridization pattern is observed. That is all probes in the first probe set of the second group show matched hybridi- 
zation, and probes from the second, third and fourth probe sets in the second probe group show mismatch hybridization. 
All but one probe in the first group of probes shows mismatch hybridization. The one probe showing perfect hybridization 
has an interrogation site aligned with the polymorphic site and occupied by the complement of the base occupying the 
polymorphic site in the second allelic form. 

[0044] If such an array is hybridized to a target sample containing heterozygous first and second allelic fornis, the 
aggregate of the above two hybridization patterns is observed. That is, all probes In the first probe set from both the 
first and second group show perfect hybridization (albeit with reduced intensity relative to a homozygous target), and 
one additional probe from the second, third or fourth probe set in each group shows perfect hybridization. In each 
group, this probe has an interrogation position aligned with the potynnorphic site and occupied by a base occupying 
the polymorphic site in one or other of the allelic forms. 

[0045] Typically, secondary arrays contain multiple subsets of each of the probe sets described, with a separate 
subset for each polymorphism. Thus, for example, a secondary array for analyzing a thousand polymorphisms might 
contain first and second groups of probes, each containing four probe sets, with each of the four probe sets, being 
divided into 1000 subsets corresponding to the 1000 different polymorphisms. In this situation, analysis of the hybrid- 
ization patterns from four subsets relating to any given polymorphisms is independent of any other polymorphism. 
[0046] Analysis of the hybridization pattern of a secondary array to a target sample indicates which polymorphic form 
is present at some or all of the polymorphic sites represented on an array Thus, the individual is characterized with a 
polymorphic profile representing allelic variants present at a substantial collection of polymorphic sites. 

6. Synthesis and Scanning of Probe Arrays 

[0047] Arrays of probe immobilized on supports can be synthesized by various methods, A preferred methods is 
VLSIPS™ (see Fodor et al.. 1991, Fodor et a!., 1993. Nature 364, 555-556; McGall et al., USSN 08/445,332; US 
5,143.854; EP 476,01 4), which entails the use of light to direct the synthesis of oligonucleotide probes in high-derisity. 
miniaturized arrays. Algorithms for design of masks to reduce the number of synthesis cycles are described by Hubbet 
et al., US 5,571.639 and US 5,593.839. Arrays can also be synthesized in a combinatorial fashion by delivering mon- 
omers to cells of a support by mechanically constrained fiowpaths. See Winkler et al., EP 624,059. Arrays can also be 
synthesized by spotting nnonomers reagents on to a support using an Ink jet printer. See id ; Pease et al„ EP 728,520. 
[0048] After hybridization of control and target samples to an array containing one or more probe sets as described 
above and optronal washing to renrrove unbound and nonspecifically bound probe, the hybridization intensity for the 
respective samples is determined for each probe In the array. For fluorescent labels, hybridization intensity can be 
determined by, for example, a scanning confocal microscope in photon counting mode. Appropriate scanning devices 
are described by e.g., Trulson et al., US 5,578,832; Stem et al.. US 5,631 ,734. 

7. VSariatlons 

fa) Tertiary Arrays for Analysis of RNA 

[0049] If a secondary an-ay of probes representing a large collection of polymorphic sites is hybrkJized to an unam- 
plified RNA target sample, then only probes spanning polymorphic sites present in highly expressed RNAs typically 
hybridize to a detectable extent. A tertiary array is then produced for future use containing only the subsets of probes 
spanning polynryorphk; sites represented in highly expressed RNA transcripts. Such an array can be used for allelic 
profiling without the need to amplify nuciek: acids in target sample, 

(b) DistinQ uishinq seguenctno errors from polvnrwrphisms in published sequences 

[0050] A variation on the previously described methods for de novo identificat»n for polymorphisms starts by com- 
parison of two published versions of the same genetk: sequences. Frequently, published versions from Independent 
sources show divergence at one or more sites and it is not clear whether the divergence results from sequencing en-or 
or is the result of allelic varlatron. These possibilities can be distinguished by using probe arrays spanning and com- 
plementarity to one or both of the reported sequences about the site of potential variation. Arrays of either the primary 
or secondary type described above can be used. The probe arrays are hybridized to target samples from a collection 
of individuals. The target samples are typically prepared by amplification of nucleic acids using primers flanking a 
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fragment including the site of potential variation between published sequences. If the probe arrays show the same 
hybridization pattern to each of the target samples, it is concluded that the apparent divergence between the sequences 
IS probably due to sequencing error. If the probe arrays show at least two different hybridization pattems to two different 
target samples from different individuals, then the site of divergence is confirmed as a bona fide polymorphic site. 
[0051] In some instances, the original data from which a published sequence was derived is also published or oth- 
envise publicly available. Such data may exist as a sequencing gel ladder or as automatic sequencer trace profiles 
Where at least two sources of independent data are available for what purports to be the same sequence, direct 
comparison of the original data may indicate potential points of divergence that are not represented in the published 
sequences. Such points of potential divergence are theri confirmed or othenwise by hybridizing multiple target samples 
to suitable arrays as described above. 



8. Uses of Potvnrtorohic Profiles 



[0OS2] After determining a polymorphic profile of an individual or population of individuals, this information can be 
used in a number of methods. 



a. Association Studies and Diagnosis 



[0053] The polymorphic profile of an individual may contribute to phenotype of the individual in different ways. Some 
polymorphisms occur within a protein coding sequence and contribute to phenotype by affecting protein structure The 
effect may be neutral, beneficial or detrimental, or both beneficial and detrimental, depending on the circumstances 
For example, a heterozygous sickle cell mutation confers resistance to malaria, but a homozygous sickle cell mutation 
IS usually lethal. Other polymorphisms occur in noncoding regions but may exert phenotypic effects indirectly via in- 
fluence on replication, transcription, and translation. A single polymorphism may affect more than one phenotypic trait. 
Likewise, a single phenotypic trait may be affected by polymorphisms in different genes. Further, some polymorphisms 
predispose an individual to a distinct mutation that is causally related to a certain phenotype. 
[0054] Phenotypic traits include diseases that have known but hitherto unmapped genetic components (e.g. , agam- 
maglobulimenia. diabetes insipidus, Lesch-Nyhan syndrome, muscular dystrophy, Wiskott-Aldrich syndrome, Fabry's 
disease, familial hypercholesterolemia, polycystic kidney disease, hereditary spherocytosis, von Willebrand's disease, 
tuberous sclerosis, hereditary hemorrhagic telangiectasia, familial colonic polyposis, Ehlers-Danlos syndrome, osteo^ 
genesis imperfecta, and acute intermittent porphyria). Phenotypic traits also include symptoms of, or susceptibility to. 
multifactorial diseases of which a component is. or may be. genetic, such as autoimmune diseases, inflammation 
cancer, diseases of the nen^ous system, and infection by pathogenic microorganisms. Some examples of autoimmune 
diseases include rheumatoid arthritis, multiple sclerosis, diabetes (insulin-dependent and non-independent), systemic 
lupus erythematosus and Graves disease. Some examples of cancers include cancers of the bladder, brain, breast, 
colon, esophagus, kidney, leukemia, liver, lung, oral cavity, ovary, pancreas, prostate, skin, stomach and uterus. Phe- 
notypic traits also include characteristics such as longevity, appearance (e.g., baWness. obesity) . strength, speed, 
endurance, fertility, and susceptibility or receptivity to particular drugs or therapeutic treatments. 
[0055] Correlation is performed for a population of individuals who have been tested for the presence or absence of 
one or more phenotypic traits of interest and for polymorphic profile. The alleles of each polymorphism in the profile 
are then reviewed to detemnine whether the presence or absence of a particular allele is associated with the trait of 
interest. Correlation can bo performed by standard statistical methods such as a ic-squared test and statistkially sig- 
nificant correlations between polymorphic form(s) and phenotypic characteristics are noted. For example, it might be 
found that the presence of allele A1 at polynwrphism A correlates with heart disease. As a further example, it might 
be found that the combined presence of allele A1 at polymorphism A and allele B1 at polymorphism B correlates with 
increased risk of cancer. 

[0056] Such correlations can be exploited in several ways. In the case of a strong correlation between a set of one 
or nrwre polymorphic forms and a disease for which treatment is available, detection of the polymorphic form set in a 
human or animal patient may justify immediate administration of treatment, or at least the institutbn of regular moni- 
toring of the patient. Detection of a polymorphic form(s) correlated with sertous disease in a couple contemplating a 
family may also be valuable to the couple in their reproductive decisions. For example, the female partner might elect 
to undergo in vitro fertilization to avokJ the possibility of transmitting such a polymorphism from her husband to her 
offspring. In the case of a weaker, but still statistically significant correlation between a polymorphic set and human 
disease, immediate therapeutic intervention or monitoring may not be justified. Nevertheless, the patient can be mo- 
tivated to begin simple life-style changes (e.g., diet, exercise) that can be accomplished at little cost to the patient but 
confer potential benefits in reducing the risk of conditions to which the patient may have increased susceptibility by 
virtue of variant alleles. Identification of a polymorphic profiles in a patient correlated with enhanced receptiveness to 
one of several treatment regimes for a disease indicates that this treatment regime should be foltowed. 
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[0057] For animals and plants, correlations between polymorphic profiles and phertotype are useful for breeding for 
desired characteristics. For example, Beitz et al., US 5,292,639 discuss use of bovine mitochondrial polymorphisms 
in a breeding program to improve milk production in cows. To evaluate the effect of mtDNA D-loop sequence polymor- 
phism on milk product k5n, each cow was assigned a value of 1 rf variant or 0 if wildtype with respect to a prototypical 
s mitochondrial DNA sequence at each of 17 locations conskJered. Each production trait was analyzed individually with 
the folkDwing animal model: 

Yiikpo= ^ + YS, + Pj + + + ... + PE„ + a^ +ep where Y^^p is the milk, fat, fat percentage, SNF, SNF 
percentage, energy concentration, or lactation energy record; \x is an overall mean; YS; is the effect common to all 
cows calving in year-season; X(( is the effect common to cows in either the high or average selection line; to P17 are 
10 the binomial regressbns of productkxi record on mtDNA D^oop sequence polymorphisms; PEn is pemnanent environ- 
mental effect comnxin to all records of cow n; a^ is effect of animal n and is composed of the additive genetk; contribution 
of sire and dam breeding values and a Mendelian sampling effect; and e^ is a random residual. It was found that eleven 
of seventeen polymorphisms tested influenced at least one production trait. Bovines having the best polymorphic forms 
for milk production at these eleven loci are used as parents for breeding the next generation of the herd. 

15 

b. Forensics 

[0058] Determinatk^rt of which polymorphic forms occupy a set of potynnorphic sites in an individual kJentifies a set 
of polymorphic forms that distinguishes the individual. See generally Natbnal Research Council, The Evaluation of 

20 Forensic DNA Evidence (Eds. Pollard et al., National Academy Press, DC, 1996). The more sites that are analyzed 
the tower the probability that the set of polymorphic forms in one individual is the same as that in an unrelated individual. 
[0059] The capacity to identify a distinguishing or unique set of forensic markers in an individual is useful for forensic 
analysis. For example, one can determine whether a btood sample from a suspect matches a blood or other tissue 
sample from a crime scene by determining whether the set of polynnorphic forms occupying selected polymorphk; sites 

25 is the same in the suspect and the sample. If the set of polymorphic markers does not match between a suspect and 
a sample, it can be concluded (barring experimental error) that the suspect was not the source of the sample. If the 
set of markers does match, one can conclude that the DNA from the suspect is consistent with that found at the crime 
scene. If frequencies of the polymorphic forms at the toci tested have been determined (e.g., by analysis of a suitable 
populatton of indivkluals], one can perform a statistical analysis to determine the prot>ability that a match ot suspect 

30 and crime scene sample would occur by chance. 

[0060] p(ID) is the probability that two random individuals have the same polymorphic or altelk: form at a given pol- 
ymorphic site. In dtallelic loci, four genotypes are possible: AA, AB, BA, and BB. If alleles A and B occur in a haptoid 
genome of the organism with frequencies x and y, the probability of each genotype in a dtptoid organism are (see WO 
95/12607): 

35 

Homozygote: p( AA)= x^ 

Homozygote: p(BB)= y2 = (1 -x)2 

Single Heterozygote: p(AB)= p(BA)= xy = x{1-x) 

Both Heterozygotes: p(AB+BA)= 2xy = 2x(1-x) 

40 

[0061] The probability of kjentity at one locus (i.e, the probability that two individuals, picked at random from a pop- 
ulation will have identical polymorphic forms at a given locus) is given by the equation: 

45 p(tD) = (xY + (2xy)^ + (yY- 

[0062] These calculations can be extended for any number of polynrwrphic forms at a given locus. For example, the 
probability of identity p(ID) for a 3-allele system where the alleles have the frequencies in the population of x, y and z, 
respectively, is equal to the sum of the squares of the genotype frequencies: 

so 

p(ID) = X* + (2xy)^ + (2yz)^ + (2xz)^ + + y* 

[0063] In a locus of n alleles, the appropriate binomial expanston is used to calculate p(ID) and p(exc). 
ss [0064] The cumulative probability of identity (cum p(ID)) for each of multiple unlinked toci is determined by multiplying 
the probabilities provided by each tocus. 
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cum p(tD) = p(IDl)p(ID2)p(lD3).... p(IDn) 

[0065} The cumulative probability of non-identity for n loci (i.e. the probability that two random individuals will be 
5 different at 1 or more kx;i) Is given by the equation: 

cum p(nonlD) = 1-cum p(lD). 

10 [0066] If several polymorphic loci are tested, the cumulative probability of non-identity for random individuals be- 
comes very high (e.g.. one billion to one). Such probabilities can be taken into account together with other evidence 
in determining the guilt or innocence of the suspect. 

B. Patemitv Testing 

IS 

[0067] The object of paternity testing Is usually to determine whether a male is the father of a child. In most cases, 
the mother of the child is known and thus, the mother's contribution to the chib's genotype can be traced. Paternity 
testing investigates whether the part of the child's genotype not attributable to the mother is consistent with that of the 
putative father. Paternity testing can be performed by analyzing sets of polymorphisms in the putative father and the 
20 child. 

[0068] If the set of polymorphisms in the child attributable to the father does not match the putative father, It can be 
concluded, ban^ing experimental error, that the putative father Is not the real father. If the set of polymorphisms in the 
child attributable to the father does match the set of polymorphisms of the putative father, a statistical calculation can 
be performed to determine the probability of coincidental match. 
25 [0069] The probability of parentage exclusion (representing the probability that a random male will have a polymor- 
phic form at a given polymorphic site that makes him incompatible as the father) Is given by the equation (see WO 
95/12607): 

^ p(exc) = xy(l-xy) 

where x and y are the population frequencies of alleles A and B of a diallelic polymorphic site. 
[0070] (At a triallelic site p(exc) = xy(l-xy) + yz(1 - yz) + xz(1 -xz)+ 3xyz(1 -xyz))), where x, y and z and the respective 
population frequencies of alleles A, B and C). 
35 [0071] The probability of non- exclusion is 

p(non-exc) ~ 1-p{exc) 

40 [0072] The cumulative probability of non-exclusion (representing the value obtained when n loci are used) is thus: 

cum p(non-exc) = p(non-exc1)p(non-exc2)p(non-exc3).... p (non-excn) 

45 [0073] The cumulative probability of exclusion for n loci (representing the probability that a random male will be 
excluded) 

cum p(exc) = 1 - cum p(non-exc). 

so 

[0074] If several polymorphic kx:i are included in the analysis, the cumulative probability of exclusion of a random 
male is very high. This probability can be taken into account in assessing the liability of a putative father whose poly- 
morphic nnarker set matches the child's polymorphic marker set attributable to his/her father 
[0075] All publications and patent applications cited above are Incorporated by reference in their entirety for all pur- 
55 poses to the same extent as if each individual publication or patent applicatton were specifically and Individually indi- 
cated to be so incorporated by reference. Although the present invention has been described in some detail by way 
of illustration and example for purposes of clarity and understanding, it will be apparent that certain changes and 
rrkodifications may be practiced wHhin the scope of the appended claims. 
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Claims 

1. A method of polymorphism analysis, comprising: 

5 (a) constructing a first array of probes spanning and complementary to one or more known DNA sequences; 

(b) hybridizing the first array of probes wtth nucleic acid samples from different individuals, whereby differences 
in the hybridization pattern of the samples to the probes among the different individuals indicate the location 
of one or more polymorphic sites in the one or more DNA sequences; 

(c) repeating (a) and (b) as needed until a collection of pclyrDorphic sites in known DNA sequences has been 
10 identified; 

(d) constructing a second array of probes comprising a first set of probes spanning each of the polymorphic 
sites in the collection and complementary to potymorphic forms present in the known sequences, and a second 
set of probes spanning each of the polymorphic sites in the collection and complementary to polymorphic 
forms absent in the known DNA sequences; 

15 (e) hybrkJizing the second array of probes to a nucleic acid sample from a further individual, and analyzing 

the hybridization intensities of probes in the first and second sets of probes to determine a profile of polymorphic 
forms present in the further individual. 

2. The method of claim 1 , wherein the further individual in step (e) has a known characteristk; whose presence or 
^ absence is unknown in the different individuals in step (b). 

3. The method of claim 1 , further comprising retrieving the known DNA sequences from a computer database whereby 
the probes spanning the known DNA sequences are selected. 

25 4. The method of claim 3, wherein the known DNA sequences are of unknown function. 

5. The method of claim 3. wherein the known DNA sequences are expressed sequence tags. 

6. The method of claim 2, wherein the known characteristk: is tlie presence of a disease. 

30 

7. The method of claim 2, wherein the known characteristic is the absence of a disease. 

8. The method of claim 2, wherein the known characteristk: is being suspected of perpetrating a crime. 
35 9. The method of claim 1 , wherein the known DNA sequences comprise at least 100 sequences. 

10. The method of claim 9, wherein the known sequences occur on at least 2 chromosomes. 

11. The method of claim 9, wherein the known sequences occur on each of the 23 human chromosomes. 

40 

12. The method of claim 9. wherein at least 10 of the known sequences are unlinked. 

1 3. The method of claim 1 , wherein the hybridization pattern of a sample from one of the individuals shows a footprint 
in which probes spanning a polymorphism hybridize with reduced hybridization intensity relative to the hybridization 

4S pattern of a sample from another individual. 

14. The method of claim 1 , wherein there are at least 10 different individuals in step (b). 

15. The method of claim 1 , wherein the further nucleic acid sample is RNA or cDNA. 

so 

16. The method of claim 15, further comprising determining from the hybridization intensities of probes in the second 
array a subset of polymorphic sites in a subset of known sequences which are expressed in the further nuclsk: 
acid sample. 

55 17. The method of claim 16, further comprising: 

(f) constructing a third array of probes comprising first subset of probes spanning each of the subset of poly- 
riKirphk: sites in the collection and complementary to potymorphic fomis present in the subset of known s©- 
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quences. and a second set of probes spanning each of the subset of polynrrorphic sites in the collection and 
comptementary to polyniorphic forms absent In the subset of known DNA sequences; and 
(g) hybridizing the third array of probes to an RN A sample from a third individual, and analyzing the hybridization 
intensities of probes in the first and second sets of probes to determine a profile of polymorphic forms present 
5 in the third individual. 

18. The method of claim 17, wherein the RNA sample is obtained without amplification. 

19. A method of polymorphism analysis, comprising: 

10 

retrieving multiple versions of a nucleic acid sequence from a published source; 
comparing the multiple versions to identify point(s) of diversion; 

designing an array comprising probes spanning and complementary to part(s) of the nucleic acid sequence 
spanning the point(s) of diversion; and 
75 hybridizing nucleic acid samples from multiple individuals to the array, whereby the existence of difference(s) 

in the hybridization intensity of probes spanning a point of diversion among the individuals indicates a poly- 
morphism at the point of diversion, and lack of difference in hybridization intensity of probes spanning a point 
of diverston among the individuals indicates the point of diversion was due to a sequencing error. 

20 20. The method of claim 1 9, wherein the published source is a computer database. 

21. The method of claim 1 9, wherein the multiple versions of the nucleic acid sequence are retrieved as trace profiles. 

22. A method of polymorphism analysis comprising: 

2S 

provkJing an immobilized array of probes comprising a first set of probes spanning each of a collection of 
polynnorphic sites In known sequences and complementary to a first allelic forms of the sites, and a second 
set of probes spanning each of the polymorphic sites in the collection and complementary to second allele 
forms of the sites, wherein the collection of polynrwrphic sites includes at least 10 unlinked polymorphic sites; 
30 hybrkJizing a nuciek; acid sample from an individual to the array of probes and analyzing the hybridization 

Intensities of probes in the first and second probe sets to determine a profile of polymorphk; forms present in 
the indivkfual. 

23. The method of claim 22, wherein the collection of polymorph^ sites includes a polyrrorphic site on each of the 23 
35 human chromosomes. 

24. The method of claim 22, wherein the collection of polymorphic sites includes at least 100 polymorphic sites. 

25. The method of claim 22, wherein the hybridizing step is repeated for nucleic acid samples from a population of 
40 Individuals, each of whom is characterized for the presence or absence of a phenotype, to determine a profile of 

polymorphic forms present in each individual In the population, and the method further comprises correlating pro- 
files of potymorphb forms with the presence or absence of the phenotype in the populatkxi. 

26. The method of claim 22, wherein the nucleic acid sample is RNA or cDNA. 

45 

27. The method of claim 26. further comprising enriching for transcripts of the known sequences in the nucleic acid 
sample by contacting the nuclek; acid sample with probes complementary to the transcripts, whereby the probes 
hybridize to the transcripts to form complexes, Isolating the complexes and dissociating the transcripts of the known 
sequences from the probes. 

so 

28. The method of claim 22. wherein the nucleic acid sample is a genome DNA sample. 

29. The method of claim 22, wherein the nucleic acid sample is prepared by ampliftcatfon with random primers. 

55 
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